Every day, it is estimated that the world produces roughly 2.5 quintillion bytes of information, a large fraction of which is comprised of text data. These data come in the form of invoices, contracts, proposals, news articles, scientific journals, and ebooks among countless others sources. With such massive volumes of text being collected all the time, it's often impossible for a human to parse through all of it manually in an efficient way. As such, there is a great need for automated text processing methods to tag and cluster documents, extract meaning from them, and, in some cases, identify their sentiment.
The ability to process text automatically is especially important in the business setting, where product reviews and customer feedback are extremely valuable in helping shape a brand or product to better fit consumer needs. According to this source, approximately 80% of potential customers trust online reviews as much as personal recommendations from friends or family members, and a whopping 88% of them incorporate product reviews into their decision of whether or not they will ultimately purchase an item. Being able to quickly weed through and process thousands of comments and reviews to find out what people are saying about a product can give a business a significant advantage over their competitors.
In this project we'll check out some of the most commonly used methods of topic modeling and document clustering methods employed today. These include techniques like $n$-gram anlysis, part-of-speech tagging, term frequency$-$inverse document frequency, k$-$means clustering, and latent Dirichlet allocation. The text data used for this project consist of news article descriptions scraped from various sources around the world like The Guardian, Bloomberg, and the Associated Press, among many others. This project was heavily influenced by the excellent data science blog post written by Ahmed Besbes, which can be found here.
To get the news article data that we'll use in this project, we can make use of the handy newsapi.org API, which happens to be very simple to use. On the newsapi homepage, one will see a "GET API KEY" in the upper right-hand corner. After clicking this link, you'll be asked to provide an email address and a password...and that's all there is to it! Well, almost. You'll be presented with something called an "API key", which is a long series of numbers and letters. This is sort of like a password, and is unique to each individual user. To try the API, simply type:
https://newsapi.org/v1/articles?source={NEWSSOURCE}&apiKey={APIKEY}
in the search bar. Before doing so, however, replace {APIKEY} with your own API key, and replace {NEWSSOURCE} with a news source like Bloomberg. So our link would then become something like:
https://newsapi.org/v1/articles?source=bloomberg&apiKey=633qw20c29984ce08504f5c89c2cmm02
Note that in the above example, that particular API key does not work, but is just used to demonstrate how this would be done in practice. The output looks something like this:
import requests
site = 'https://newsapi.org/v1/articles?source=bloomberg&apiKey=636cf20c29984ce08504f5c89c2cee02'
source_data = requests.get(site, allow_redirects=True).json()
print(json.dumps(source_data, indent=2)[:1890])
Each output is a set of nested dictionaries with information regarding the
One of the great things about newsapi is that this sort of data can be collected from over 70 news sources and blogs from various parts of the world, some of which are not even in english. In order to fetch the data like we've done above for many news sources, we need a way to automate the process. In general, the tasks we need to perform are:
To this end, I wrote a Python module called scrapeNews.py to automatically perform the tasks listed above. I should note again that Ahmed Besbes' data science blog was an outstanding source of information for how to do this.
import pandas as pd
import requests
import os.path
import csv
API_key = '636cf20c29984ce08504f5c89c2cee02'
outdir = '/Users/degravek/Downloads/'
def articleType():
sort_by = ['top', 'latest', 'popular']
source_id = ['associated-press', 'bbc-news', 'bbc-sport', 'bloomberg',
'business-insider', 'cnbc', 'cnn', 'daily-mail', 'engadget',
'entertainment-weekly', 'espn', 'financial-times', 'fortune',
'four-four-two', 'fox-sports', 'google-news', 'hacker-news', 'mtv-news',
'national-geographic', 'new-scientist', 'newsweek', 'nfl-news', 'reuters',
'talksport', 'techcrunch', 'techradar', 'the-economist',
'the-guardian-uk', 'the-huffington-post', 'the-new-york-times',
'the-next-web', 'the-sport-bible', 'the-telegraph', 'the-verge',
'the-wall-street-journal', 'the-washington-post', 'time', 'usa-today',
'ars-technica', 'al-jazeera-english']
return source_id, sort_by
def getCategories():
site = 'https://newsapi.org/v1/sources'
source_info = requests.get(site).json()
source_categories = {}
for element in source_info['sources']:
source_categories[element['id']] = element['category']
return source_categories
def writeData(output):
file_name = outdir + 'news_articles.csv'
df = pd.DataFrame(output, columns=['publishedAt', 'author', 'category', 'title',
'description', 'url'])
if os.path.isfile(file_name):
articles = pd.read_csv(file_name)
df = df.append(articles)
df = df.drop_duplicates(subset='url')
df.to_csv(file_name, mode='w', encoding='utf-8', index=False, header=True)
else:
df = df.drop_duplicates(subset='url')
df.to_csv(file_name, mode='w', encoding='utf-8', index=False, header=True)
def scrapeNews():
source_id, sort_by = articleType()
source_categories = getCategories()
output = []
for sid in source_id:
for sb in sort_by:
# Get news article json
site = 'https://newsapi.org/v1/articles?source=' + sid + '&sortBy=' + \
sb + '&apiKey=' + API_key
source_data = requests.get(site).json()
try:
for element in source_data['articles']:
if not element['author']: element['author'] = 'no_author'
output.append([element['publishedAt'], element['author'],
source_categories[sid], element['title'],
element['description'], element['url']])
except:
pass
writeData(output)
if __name__ == '__main__':
scrapeNews()
The code above might look daunting at first, but it's actually not so bad. At the top, I define my API key and the directory where my file containing news articles will be saved. The function articleType() defines which news sources we're interested in grabbing articles from. From these sources, scrapeNews() will then grab the top, latest, and popular articles.
Every news source available is given a "category" by newsapi. For example, Bloomberg is given a "general" tag as this source generally covers many different news topics, while ESPN is given a "sport" tag. One will note that in our example earlier, though, category was not a key in the dictionary. The categories for each source can be instead be found here:
https://newsapi.org/v1/sources
The function getCategories() loops over every possible news source and grabs the corresponding category from the list. The result is a dictionary where the first ten entries look something like this:
site = 'https://newsapi.org/v1/sources'
source_info = requests.get(site).json()
source_categories = {}
for element in source_info['sources']:
source_categories[element['id']] = element['category']
dict(zip(list(source_categories.keys())[:10], list(source_categories.values())[:10]))
The function scrapeNews() then loops over each news source we've decided to use, finds its coresponding category using the dictionary above, and then grabs all the relevant news article information from that source using our API key.
Because we're collecting the top, latest, and popular articles for each source, there are bound to be duplicate articles included when we're done. Therefore, before writing the data to file, we load in any previously save versions of the file and drop duplicate entries based on the article url column. All of this is done in writeData().
We can run scrapeNews.py easily by calling python scrapeNews.py from the terminal command line. One will notice though that the number of news articles scraped after this run is only on the order of a few hundred or so $-$ not nearly enough for our purposes. We could manually run the code every 15 minutes or so, but that would be pretty tedious. To get around this, we can set up something called a "cron job" to execute the script for us automatically over fixed time intervals.
Setting up a cron job is pretty easy. In our case (working on a Mac), we need to follow three steps:
For step (3), the input will look something like:
* * * * * ~/anaconda/bin/python /Users/degravek/Site/news/scrapeNews.py
The second term in the line above (~/anaconda/bin/python) is the path to my python distribution, while the third term (/Users/degravek/Site/news/scrapeNews.py) is the absolute path to where scrapeNews.py is located. The five asterisks are palceholders that determine how frequently you want the code to be run, and their meaning is described well by the diagram below.

For this project, I ran scrapeNews.py every 15 minutes for a several days, so my input looked like
*/15 * * * * ~/anaconda/bin/python /Users/degravek/Site/news/scrapeNews.py
When enough data has been collected, simply type control -r in the terminal command line to stop the cron job.
Okay, now let's start looking at the data! Before we get started, let's import some useful Python libraries that will come in handy later.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from collections import Counter
from string import punctuation
import pandas as pd
import numpy as np
import nltk
import re
import matplotlib.pyplot as mp
%matplotlib inline
from plotly.offline import download_plotlyjs
from plotly.offline import init_notebook_mode
from plotly.offline import plot, iplot
import cufflinks as cf
cf.go_offline()
stoplist = stopwords.words('english')
We now load in the csv containing the news article data, and drop any duplicates that may have gotten through.
df = pd.read_csv('/Users/degravek/Downloads/news_articles.csv')
df = df.drop_duplicates(subset='url')
df = df.drop_duplicates(subset='title')
df = df.reset_index(drop=True)
df = df[df['description'].notnull()].reset_index(drop=True)
df.loc[df['category'] == 'science-and-nature', 'category'] = 'science-nature'
Let's see what the data look like.
df.head()
As discussed earlier, we have collected the
We can check out the distribution of categories. I use plotly and cufflinks to do this, as they produce nice interactive figures with functions call similar to Pandas.
df['category'].value_counts().iplot(kind="bar", xTitle='Category', color='blue',
yTitle='Counts', dimensions=(790,550), margin=(100,100,60,0))
In all, the news articles fall into seven categories:
General is sort of a catch-all category (often containing political and world news), and therefore has many more articles than the other categories. There are also more of these types of news sources that we've included in scrapeNews.py. Among the seven categories, we see that the Science & Nature and Music make up a small fraction of the total number of articles. Let's take a look at example article descriptions for some of these categories.
for category in set(df['category']):
print('Category:', category)
print(df[df['category'] == category]['description'].tolist()[0])
print('--------------------')
Alright, let's get to the modeling! We'll start with $n$-gram analysis.
The method of $n$-gram analysis is one of the most common types of topic modeling techniques employed today. By $n$-gram analysis, I mean examining combinations of $n$ sequential words in a sentence. For example, the 1-grams (called unigrams) and 2-grams (called bigrams) of the sentence "this is an example of a sentence" would be:
In the first case, the unigrams are just a list of all single words in the sentence. In the second case, the bigrams are all sequential two-word combinations.
By looking at which $n$-grams occur most frequently in a document, we can get a sense for what is being talked about. In python, the $n$-grams can be computed in two ways $-$ manually using a function that we define ourselves, or automatically using Python Learn's CountVectorizer method. Let's try both!
Before the $n$-grams can be computed, we first have to process the text a little bit to remove puctuation and stop words. Stopwords are words that occur so frequently in normal speech that they're of little interest to us. These include words like "the", "and", etc. We can define two functions below to perform these tasks $-$ they're called processText() and rmStopwords(). A third function called nGrams takes in some text, tokenizes it (i.e., splits it into its individual words) and finds the $n$ sequential word combinations. In this manual implementation, we can also perform an additional task that is helpful in identifying interesting words. If we're finding unigrams, the function tags the part of speech of each word and keeps only nouns, which are more likely to be the subject of a piece of text.
punctuation = '!"#?$%“”&\'()’*+,—./:;<=>@[\\]^_`{|}~…' #(removed - symbol)
def processText(text):
result = text.lower()
result = ''.join(word for word in result
if word not in punctuation)
result = re.sub(r' +', " ", result).strip()
return result
def rmStopwords(text):
result = text.lower().split()
result = ' '.join(word for word in result
if word not in stoplist)
result = re.sub(r' +', " ", result).strip()
return result
def nGrams(text, n):
result = []
text = text.split()
if n==1:
partspeech = nltk.pos_tag(text)
result = [word for word, pos in
partspeech if pos[0] == 'N']
else:
for i in range(len(text)-(n-1)):
result.append(' '.join(text[i:i+n]))
if not result:
result = np.nan
return result
With these functions defined, we can now process the text and compute $n$-grams. In processing the text, we keep only those news articles where the description has more than 140 characters, as shorter descriptions sometimes don't contain a lot of useful information and generally add noise when performing topic modeling.
df['pro_descr'] = df['description'].apply(rmStopwords).apply(processText)
df['num_chars'] = df['pro_descr'].apply(len)
df = df[df['num_chars'] > 140].reset_index(drop=True)
df['ngram_1'] = df['pro_descr'].apply(nGrams, args=(1,))
df['ngram_2'] = df['pro_descr'].apply(nGrams, args=(2,))
df['ngram_3'] = df['pro_descr'].apply(nGrams, args=(3,))
Let's look at the unigrams, bigrams, and trigrams.
df[['description','ngram_1','ngram_2','ngram_3']].head()
We can now see which $n$-grams occur most frequently in each category.
def keywords_ngram(category, n, num):
text = df[df['category'] == category]['ngram_' + str(n)]
result = []
for word_list in text:
result += word_list
return Counter(result).most_common(num)
for category in set(df['category']):
print("Category:", category)
print("1-grams:", keywords_ngram(category, 1, 10))
print('--------------------')
print("2-grams:", keywords_ngram(category, 2, 10))
print('--------------------')
print("3-grams:", keywords_ngram(category, 3, 10))
print('--------------------')
Now let's perform the same exercise, but using Scikit-Learn's CountVectorizer instead.
def keywords_cvec(category, n, num):
text = df[df['category'] == category]['pro_descr']
vector = CountVectorizer(min_df=1, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(n,n), stop_words=stoplist)
cvec = vector.fit_transform(text)
feature_names = vector.get_feature_names()
feature_count = cvec.toarray().sum(axis=0)
sort_feature = sorted(zip(feature_names, feature_count), key=lambda x: x[1], reverse=True)[:num]
return sort_feature
for category in set(df['category']):
print("Category:", category)
print("1-grams:", keywords_cvec(category, 1, 10))
print('--------------------')
print("2-grams:", keywords_cvec(category, 2, 10))
print('--------------------')
print("3-grams:", keywords_cvec(category, 3, 10))
print('--------------------')
We see that in the business category, for example, many news articles tend to focus a great deal on Donald Trump, former FBI director James Comey, and former director of the Defense Intelligence Agency Michael Flynn. Let's plot the ten most frequent trigrams.
ngram = pd.DataFrame(keywords_ngram('general', 3, 10), columns=['ngram_3','counts'])
plot = ngram.iplot(kind="bar", x='ngram_3', y='counts', xTitle='Trigrams', color='blue',
yTitle='Counts', dimensions=(790,550), margin=(80,120,110,20))
Another common method used in topic modeling is called noun-phrase chunking. The idea is to find nouns and surrounding descriptive words in the text, as these phrases often carry the main point (i.e., the topic) of a sentence. I was first introduced to the idea of chunking from this super informative article about topic modeling.
To perform chunking, the first thing we have to do is define a "chunking pattern" (this is the string of symbols in the function argument). The chunking pattern tells the function which sort of phrases to look for in the text. In our case, it's one or more optional adjectives followed by one or more nouns.
def extract_candidate_chunks(text, grammar = 'CHUNK: {<JJ.*>*<NN.*>+}'):
import itertools, nltk, string
parser = nltk.RegexpParser(grammar)
tagged_sents = [nltk.pos_tag(nltk.word_tokenize(text))]
for chunk in tagged_sents:
if not chunk:
candidates = []
else:
candidates = []
tree = parser.parse(chunk)
for subtree in tree.subtrees():
if subtree.label() == 'CHUNK':
candidates.append((' '.join([word for (word, tag) in subtree.leaves()])))
candidates = [word for word in candidates if word not in stoplist]
return candidates
Now we'll define another function to count common chunks.
def keyphrases_chunk(category, num):
tokens = df[df['category'] == category]['chunk']
alltokens = []
for token_list in tokens:
alltokens += token_list
counter = Counter(alltokens)
return counter.most_common(num)
Let's try it out!
df['chunk'] = df['pro_descr'].apply(extract_candidate_chunks)
for category in set(df['category']):
print("Category:", category)
print("chunk:", keyphrases_chunk(category, 15))
print('--------------------')
We see that chunking also does a pretty good job in grabbing the main topics of the text. The results are similar to the mixture of unigrams, bigrams, and trigrams found using $n$-gram analysis, but they can also contain longer, more informative strings like "national geographic photographer charlie hamilton".
One very popular method used in clustering documents and in isolating "important" words within those documents is called term frequency$-$inverse document frequency (abbreviated tf-idf). The idea behind tf-idf is that a word is deemed "important" if it occurs frequently within a single document (i.e., a news article description), but is less important if it occurs many times across several documents. In this way, words common to many documents like "the", "and", etc. are given a lower tf-idf score and are therefore not very important.
How can we represent tf-idf mathematically, though? To compute a tf-idf score of a term, the name itself suggests that we might count the number of times the term appears in a document (term frequency), and weight it by the number of times it occurs across all documents (inverse document frequency). Mathematically, this is written as
$$tfidf = tf(t,d) \times idf(t,D),$$Where $tf(t,d)$ is the frequency of some term $t$ in document $d$, and $idf$ is the inverse document frequency of term $t$ across all documents $D$, given as
$$idf(t,D) = log\frac{\big|D\big|}{1+\big|\{d\in D:t\in d\}\big|}$$Luckily for us, Python's Scikit$-$Learn has a built in method for doing this called TfidfVectorizer. We now feed to TfidfVectorizer the processed article descriptions (punctuation and stopwords removed).
tfidf = TfidfVectorizer(min_df=10, max_features=10000, strip_accents='unicode',
analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1,2),
use_idf=1, smooth_idf=1, sublinear_tf=1)
tfidf_descr = tfidf.fit_transform(df['pro_descr'])
tfidf_dict = dict(zip(tfidf.get_feature_names(), tfidf.idf_))
tfidf_df = pd.DataFrame(tfidf.idf_, index=tfidf.get_feature_names(), columns=['tfidf_val'])
tfidf_df.sort_values('tfidf_val', ascending=False, inplace=True)
Let's look at some of the unigrams and bigrams deemed most important through tf-idf.
tfidf_df.head(10)
Performing tf-idf has converted each article description into a 1,414-demensional vector. For plotting purposes, let's reduce the dimensionality using truncated singular value decomposition and t-distributed stochastic neighbor embedding (TSNE). TSNE is a really cool technique for preserving "distances" between the tf-idf vectors while at the same time reducing their dimensionality. This is often useful in clustering.
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50, random_state=0)
svd_tfidf = svd.fit_transform(tfidf_descr)
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, learning_rate=100)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)
tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
tfidf_df['description'] = df['description']
tfidf_df['category'] = df['category']
tfidf_df.iplot(kind="scatter", x='x', y='y', mode='markers', size=5, text='description',
dimensions=(790,650), margin=(50,50,40,40), color='blue')
We see that tf-idf does a fairly good job of clustering similar topics. For example there are clusters which discuss Donald Trump and James Comey, a cluster discussing sports, one discussing technology, etc.
Another technique sometimes used in topic modeling to group similar documents is called k$-$means clustering. The idea is that given a set of feature vectors describing various documents like those plotted above, we can use an algorithm to identify any clusters present, and also identify common key words shared between the documents in those clusters. For this work, we'll assume there are 15 clusters present.
from sklearn.cluster import MiniBatchKMeans
n_clusters = 15
kmeans_model = MiniBatchKMeans(n_clusters=n_clusters, init='k-means++', n_init=100,
batch_size=100, verbose=False, max_iter=1000, random_state=3)
kmeans = kmeans_model.fit(tfidf_descr)
kmeans_clusters = kmeans.predict(tfidf_descr)
kmeans_distances = kmeans.transform(tfidf_descr)
Let's look at the common unigrams and bigrams shared between documents in each cluster.
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()
for i in range(n_clusters):
print("Cluster %d:" % i)
for ind in sorted_centroids[i, :10]:
print(terms[ind], end=' / ')
print('\n', '--------------------')
We can see that one cluster is comprised of articles discussing a relationship between the United States, Turkey, and Syria, while another focuses on James Comey and Donald Trump, and yet another is associated with Premier League soccer. We can make another plot where the various identified clusters are given different colors.
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, learning_rate=1500)
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)
df_kmeans = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
df_kmeans['cluster'] = kmeans_clusters
df_kmeans['description'] = df['description']
df_kmeans['category'] = df['category']
colormap = ["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9",
"#68af4e", "#6e6cd5", "#e3be38", "#4e2d7c", "#5fdfa8",
"#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053"]
color_dict = dict(zip(range(0,n_clusters), colormap))
df_kmeans['color'] = df_kmeans['cluster'].map(color_dict)
import plotly.graph_objs as go
trace = go.Scatter(
x = df_kmeans['x'],
y = df_kmeans['y'],
mode = 'markers',
text = df_kmeans['description'],
marker = dict(size='5', color=df_kmeans['color'].tolist())
)
df_kmeans.iplot([trace], dimensions=(790,650), margin=(50,50,40,40))
Like tf-idf clustering, k$-$means clustering does a good job of separating topics. The figure clearly shows a solid ~15 or so clusters represented here.
Lastly, let's try out another popular technique used in topic modeling called latent Dirichlet allocation (LDA). LDA is a statistical method that assumes documents are comprised of a mixture of $n$ topics, and that the words in those documents are all attributed in some way to those topics. A detaield explanation of LDA can be found here. To perform LDA, we can make use of the lda package found here. Performing LDA can be a bit tricky though, as the results can vary dramatically depending on our choice of the number of topics to identify, how noisy the documents are, etc. Let's see how we do with our news articles!
from lda import LDA
import logging
logging.getLogger("lda").setLevel(logging.WARNING)
n_topics = 15
vector = CountVectorizer(min_df=5, max_features=10000, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(1,2))
cvec = vector.fit_transform(df['pro_descr'])
model = LDA(n_topics=n_topics, n_iter=2500, random_state=0)
topics = model.fit_transform(cvec)
n_top_words = 8
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vector.get_feature_names())[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Again, we see that there are articles discussing Donald Trump and James Comey, some discussing Premier League soccer, several about Google and new technology, etc. Let's plot the clusters.
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, learning_rate=3000)
tsne_lda = tsne_model.fit_transform(topics)
df_lda = pd.DataFrame(tsne_lda, columns=['x', 'y'])
df_lda['cluster'] = kmeans_clusters
df_lda['description'] = df['description']
df_lda['category'] = df['category']
colormap = ["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9",
"#68af4e", "#6e6cd5", "#e3be38", "#4e2d7c", "#5fdfa8",
"#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053"]
color_dict = dict(zip(range(0,n_topics), colormap))
df_lda['color'] = df_lda['cluster'].map(color_dict)
import plotly.graph_objs as go
trace = go.Scatter(
x = df_lda['x'],
y = df_lda['y'],
mode = 'markers',
text = df_lda['description'],
marker = dict(size='5', color=df_lda['color'].tolist())
)
df_lda.iplot([trace], dimensions=(790,650), margin=(50,50,40,40))
In this project we detailed a number of popular techniques used in topic modeling and document clustering today. These include methods like $n$-gram analysis, part-of-speech chunking, tf-idf, k$-$means clustering, and latent Dirichlet allocation. Interestingly, these all happen to fall under the category of "unsupervised" learning techniques in that the algorithms process text blindly without havin any a priori knowledge of what constitutes a legitimate topic. However, it is also possible to perform topic modeling using "supervised" learning methods in which algorithms can first learn what "good" topics look like, and then go on to identify them in new, unseen text. Maybe I'll leave that for a future post!
Well, that’s all I have for now. Thanks for following along!